This is part of my current paper about political reporting in German online news.
To measure the ideological content of several major online news services, I compare the topics discussed in these media with the press releases of the Bundestag parties using a structural topic model.
The following is an analysis of the content of the press releases scraped from the public websites of the political parties and political groups. A big part of this analysis is inspired from the work of Julia Silge and David Robinson (Text Mining with R - A Tidy Approach).
I assume that parties utilize their press releases to promote their issues and positions and thus also contribute to the election campaign. However, it should be noted that there is a difference between the press releases of the parties and the factions. Parties are financed by membership dues, donations and campaign expenses, while factions are financed by state funds. According to Parteigesetzt §25 (2) state funded factions may not support parties from their funds, because otherwise parties that are not in the Bundestag would be practically disadvantaged.
Since it is difficult to draw the line between faction activity and election campaign assistance, I assume that factions intervene in the public perception of this party with their press releases, which is why I include both the press releases of the federal party and the federal faction.
| title_text | text_cleaned | |
|---|---|---|
| 668 | Stephan Brandner: Extremismusbekämpfungswahnsinn – gefährlicher Linksextremismus wird ignoriert. Merkel selbst profitiert von militanten Linken und Antifa . Februar 2018. Der Kampf gegen den Linksextremismus ist der Bundesregierung im vergangenen Jahr 2017 gerade einmal 1,5 Millionen Euro wert gewesen. Dies entspricht ca. 1%, also etwa einem Hundertstel, der Mittel, die für den „Kampf gegen rechts“ im selben Jahr aufgebracht wurden, wie die Antwort der Bundesregierung auf eine schriftliche Frage des Abgeordneten Stephan Brandner (AfD) ergab.Brandner macht deutlich, dass eine derartige Schieflage bei der Bekämpfung des politischen Extremismus extrem gefährlich ist und von politischer Einäugigkeit der Bundesregierung zeugt:„Wie kann es sein, dass die Bundesregierung mit so ungleichem Maß misst? Jeder Extremismus ist eine Gefahr für die Demokratie. Egal von welcher Seite er kommt. Die Merkel-Regierung ist auf dem linken Auge blind und verkennt die Gefahren. Merkel selbst profitiert ja auch von den militanten Linken und der Antifa, etwa, wenn diese Terrortruppen regierungskritische Veranstaltungen sabotieren und verhindern. Die Linksextremisten sind auch längst paramilitärisch organisiert, haben terroristische Strukturen gebildet und waren mit ihrem ‚Marsch durch die Institutionen‘ erfolgreich. Beispielsweise bezeichnet sich ja der Thüringer Staatskanzleichef Benjamin Hoff von den Linken als stolzer Linksextremist. Mit alledem muss Schluss sein. Linke Staatsfeinde müssen mit aller Kraft bekämpft werden!“ | stephan brandner extremismusbekämpfungswahnsinn gefährlicher linksextremismus ignoriert merkel profitiert militanten linken antifa kampf linksextremismus bundesregierung vergangenen jahr millionen euro wert entspricht hundertstel mittel kampf jahr aufgebracht antwort bundesregierung schriftliche frage abgeordneten stephan brandner ergab brandner deutlich schieflage bekämpfung politischen extremismus extrem gefährlich politischer einäugigkeit bundesregierung zeugt bundesregierung maß misst extremismus gefahr demokratie egal merkel regierung linken auge blind verkennt gefahren merkel profitiert militanten linken antifa terrortruppen regierungskritische veranstaltungen sabotieren verhindern linksextremisten paramilitärisch organisiert terroristische strukturen gebildet marsch institutionen erfolgreich bezeichnet thüringer staatskanzleichef benjamin hoff linken stolzer linksextremist alledem schluss staatsfeinde kraft bekämpft |
tokens <- pressReleases %>% unnest_tokens(word, text_cleaned1)
tokens.count <- tokens %>%
count(party, word, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(word,party,n)
tokens.count %>%
arrange(desc(tf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(party) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "Term Frequency") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
Compare the word frequency for the different parties.
an empty space at low frequency indicates less similarity between two parties.
if words in a two-sided panel are closer to the zero-slope line the two parties use more similar words.
frequency <- tokens.count %>%
group_by(party) %>%
mutate(proportion = n/sum(n)) %>%
select(party, word, proportion) %>%
spread(party, proportion)
frequency %>%
gather(party, proportion, -word, -CDU) %>%
ggplot(aes(x = proportion, y = `CDU`, color = abs(`CDU` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "CDU", x = NULL)
#ggsave("../figs/word_freq_CDU.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -SPD) %>%
ggplot(aes(x = proportion, y = `SPD`, color = abs(`SPD` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "SPD", x = NULL)
#ggsave("../figs/word_freq_SPD.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -FDP) %>%
ggplot(aes(x = proportion, y = `FDP`, color = abs(`FDP` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "FDP", x = NULL)
#ggsave("../figs/word_freq_FDP.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`B90/GRÜNE`) %>%
ggplot(aes(x = proportion, y = `B90/GRÜNE`, color = abs(`B90/GRÜNE` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "B90/GRÜNE", x = NULL)
#ggsave("../figs/word_freq_GRUENE.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`DIE LINKE`) %>%
ggplot(aes(x = proportion, y = `DIE LINKE`, color = abs(`DIE LINKE` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "DIE LINKE", x = NULL)
#ggsave("../figs/word_freq_LINKE.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`AfD`) %>%
ggplot(aes(x = proportion, y = `AfD`, color = abs(`AfD` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "AfD", x = NULL)
#ggsave("../figs/word_freq_AfD.png", width = 15, height = 10)
The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents. In this case we measure how important a word is to a party (within all the press releases of that party) in the collection of all parties (and their press releases).
The inverse document frequency for any given term is defined as
\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]
In this case, \(n_{\text{documents}} = 6\) as we have 6 different parties.
Terms with low tf-idf:
tokens.count %>%
arrange(tf_idf)
## # A tibble: 55,124 x 6
## party word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 DIE LINKE bundesregierung 566 0.0109 0 0
## 2 AfD deutschland 470 0.0128 0 0
## 3 DIE LINKE deutschland 330 0.00637 0 0
## 4 DIE LINKE eu 324 0.00625 0 0
## 5 DIE LINKE menschen 286 0.00552 0 0
## 6 AfD deutschen 235 0.00641 0 0
## 7 FDP deutschland 226 0.0109 0 0
## 8 DIE LINKE endlich 219 0.00423 0 0
## 9 AfD eu 208 0.00567 0 0
## 10 DIE LINKE vorsitzend 208 0.00401 0 0
## # ... with 55,114 more rows
A 0 idf (and thus tf-idf) indicate, that these terms appear in all six parties press-releases.
The inverse document frequency (and thus tf-idf) is very low (0) for terms that occur in many (all) of the documents (all press releases of one party) in a collection (all press releases of one party);
Terms with high tf-idf.
tokens.count %>%
arrange(desc(tf_idf))
## # A tibble: 55,124 x 6
## party word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AfD weidel 185 0.00505 1.79 0.00904
## 2 FDP beer 59 0.00286 1.79 0.00512
## 3 FDP nicola 58 0.00281 1.79 0.00503
## 4 AfD pazderski 98 0.00267 1.79 0.00479
## 5 AfD alic 145 0.00395 1.10 0.00434
## 6 FDP lambsdorff 42 0.00203 1.79 0.00364
## 7 DIE LINKE dagdelen 102 0.00197 1.79 0.00353
## 8 FDP präsidiumsmitgli 57 0.00276 1.10 0.00303
## 9 AfD brandner 62 0.00169 1.79 0.00303
## 10 FDP generalsekretärin 54 0.00261 1.10 0.00287
## # ... with 55,114 more rows
tokens.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(party) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf.png", width = 11, height = 6)
Words can be considered not only as single units, but also as their relationship to each other. N-grams, for example, help to investigate which words tend to follow others immediately. To do this, we tokenize the text into successive sequences of words called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
bigrams <- pressReleases %>% unnest_tokens(bigram, text_cleaned1, token="ngrams", n=2)
bigrams.count <- bigrams %>%
count(party, bigram, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(bigram,party,n)
bigrams.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(party) %>%
top_n(15) %>%
arrange(desc(tf_idf)) %>%
ungroup %>%
ggplot(aes(reorder(bigram, tf_idf), tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf_bigram.png", width = 11, height = 6)
trigrams <- pressReleases %>% unnest_tokens(trigram, text_cleaned1, token="ngrams", n=3)
trigrams.count <- trigrams %>%
count(party, trigram, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(trigram,party,n)
trigrams.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(trigram, levels = rev(unique(trigram)))) %>%
group_by(party) %>%
top_n(15) %>%
arrange(desc(tf_idf)) %>%
ungroup %>%
ggplot(aes(reorder(trigram, tf_idf), tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf_trigram.png", width = 12, height = 6)